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Abstract 

The Bayesian framework is a well-studied and successful framework for induc- 
tive reasoning, which includes hypothesis testing and confirmation, parameter 
estimation, sequence prediction, classification, and regression. But standard 
statistical guidelines for choosing the model class and prior are not always 
available or can fail, in particular in complex situations. Solomonoff com- 
pleted the Bayesian framework by providing a rigorous, unique, formal, and 
universal choice for the model class and the prior. I discuss in breadth how 
and in which sense universal (non-i.i.d.) sequence prediction solves various 
(philosophical) problems of traditional Bayesian sequence prediction. I show 
that Solomonoff's model possesses many desirable properties: Strong total 
and future bounds, and weak instantaneous bounds, and in contrast to most 
classical continuous prior densities has no zero p(oste)rior problem, i.e. can 
confirm universal hypotheses, is reparametrization and regrouping invariant, 
and avoids the old-evidence and updating problem. It even performs well 
(actually better) in non-computable environments. 
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1 Introduction 



"... in spite of it's incomputability, Algorithmic Probability can serve as 
a kind of 'Gold Standard' for induction systems" 

— Ray SolomonofF (1997) 

Given the weather in the past, what is the probabihty of rain tomorrow? What 
is the correct answer in an IQ test asking to continue the sequence 1,4,9,16,? Given 
historic stock-charts, can one predict the quotes of tomorrow? Assuming the sun rose 
5000 years every day, how hkely is doomsday (that the sun does not rise) tomorrow? 
These are instances of the important problem of induction or time-series forecasting 
or sequence prediction. Finding prediction rules for every particular (new) problem 
is possible but cumbersome and prone to disagreement or contradiction. What is 
desirable is a formal general theory for prediction. 

The Bayesian framework is the most consistent and successful framework de- 
veloped thus far |Ear93l Jay03| . A Bayesian considers a set of environments= 



=hypotheses=models Ai which includes the true data generating probability distri- 
bution fi. From one's prior belief Wi, in environment u&Ai and the observed data 
sequence x = xi...Xn, Bayes' rule yields one's posterior confidence in z/. In a pre- 
quential |Daw84j or transductive |Vap99[ Sec. 9.1] setting, one directly determines 



the predictive probability of the next symbol Xn+i without the intermediate step 
of identifying a (true or good or causal or useful) model. With the exception of 
Section HJ this paper concentrates on prediction rather than model identification. 
The ultimate goal is to make "good" predictions in the sense of maximizing one's 
profit or minimizing one's loss. Note that classification and regression can be re- 
garded as special sequence prediction problems, where the sequence xiyi...XnynXn+i 
of (a;,|/)-pairs is given and the class label or function value yn+i shall be predicted. 

The Bayesian framework leaves open how to choose the model class A4 and prior 
w^. General guidelines are that should be small but large enough to contain the 
true environment n, and should reflect one's prior (subjective) belief in u or 
should be non-informative or neutral or objective if no prior knowledge is available. 
But these are informal and ambiguous considerations outside the formal Bayesian 
framework. Solomonoff's ^Sol64j rigorous, essentially unique, formal, and universal 
solution to this problem is to consider a single large universal class A4.u suitable 
for all induction problems. The corresponding universal prior is biased towards 
simple environments in such a way that it dominates (=superior to) all other priors. 
This leads to an a priori probability M{x) which is equivalent to the probability 
that a universal Turing machine with random input tape outputs x, and the shortest 
program computing x produces the most likely continuation (prediction) of x. 

Many interesting, important, and deep results have been proven for Solomonoff's 
universal distribution M [ZLTOl ISolTSl IGdc83l [LV971 IHutOll IHut04j . The motivation 
and goal of this paper is to provide a broad discussion of how and in which sense 
universal sequence prediction solves all kinds of (philosophical) problems of Bayesian 
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sequence prediction, and to present some recent results. Many arguments and ideas 
could be further developed. I hope that the exposition stimulates such a future, 
more detailed, investigation. 

In Section [21 I review the excellent predictive and decision-theoretic performance 
results of Bayesian sequence prediction for generic (non-i.i.d.) countable and contin- 
uous model classes. Section [3] critically reviews the classical principles (indifference, 
symmetry, minimax) for obtaining objective priors, introduces the universal prior 
inspired by Occam's razor and quantified in terms of Kolmogorov complexity. In 
Section m (for i.i.d. Ai) and Section O (for universal Aiu) I show various desirable 
properties of the universal prior and class (non-zero p(oste)rior, confirmation of uni- 
versal hypotheses, reparametrization and regrouping invariance, no old-evidence and 
updating problem) in contrast to (most) classical continuous prior densities. I also 
complement the general total bounds of Section [2] with some universal and some 
i.i.d.-specific instantaneous and future bounds. Finally, I show that the universal 
mixture performs better than classical continuous mixtures, even in uncomputable 
environments. Section [6] contains critique, summary, and conclusions. 

The reparametrization and regrouping invariance, the (weak) instantaneous 
bounds, the good performance of M in non-computable environments, and most 
of the discussion (zero prior and universal hypotheses, old evidence) are new or new 
in the light of universal sequence prediction. Technical and mathematical non-trivial 
new results are the Hellinger-like loss bound ([8]) and the instantaneous bounds (iMj) 
and ([nD. 

2 Bayesian Sequence Prediction 

I now formally introduce the Bayesian sequence prediction setup and describe the 
most important results. I consider sequences over a finite alphabet, assume that 
the true environment is unknown but known to belong to a countable or continuous 
class of environments (no i.i.d. or Markov or stationarity assumption), and consider 
general prior. I show that the predictive distribution converges rapidly to the true 
sampling distribution and that the Bayes-optimal predictor performs excellent for 
any bounded loss function. 

Notation. I use letters t,nElN for natural numbers, and denote the cardinality of 
a set S by or \S\. I write X* for the set of finite strings over some alphabet 
A", and X°° for the set of infinite sequences. For a string xG X* of length ^{x) = n 
I write xiX2--.Xn with xt G X, and further abbreviate Xun '■= XtXt+i...Xn~iXn and 

X<^n ■ Xi-. .Xfi—\. 

I assume that sequence u = ui.oo G X°° is sampled from the "true" probability 
measure /i, i.e. /i(xi:„) : = P[co'i:„ = xi:„|/i] is the /i-probability that u starts with 
I denote expectations w.r.t. fi by E. In particular for a function f : X"^ M, we have 
E[/] = E[/(tt;i:n)] = ^2,^._^/u(xi:n)/(a;i:„). Notc that in Bayesian learning, measures, 
environments, and models coincide, and are the same objects; let M. = {ui,i'2v} 
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denote a countable class of these measures. Assume that (a) n is unknown but 
known to be a member of Ai, (b) {H^, : u & Ai} forms a mutually exclusive and 
complete class of hypotheses, and (c) :=P[Hi,] is the given prior belief in H,^. 
Then := P[u;i:„ = Xi;„] = Y.ueM^l'^^--r^ = Xi:n\H^]P[H^] must be our (prior) 

belief in and w„{xi:n) :='P[Hi,\ui,n = ^i.n] = be our posterior 

belief in u by Bayes' rule. 

For a sequence 01,02,... of random variables, X^t^i-^lo-t] < c< 00 implies a^-^O 
with /i-probability 1 (w.p.l). Convergence is rapid in the sense that the probability 
that of exceeds e > at more than ^ times t is bounded by 6. I sometimes loosely 
call this the number of errors. 

Sequence prediction. Given a sequence XiX2---Xt-i, we want to predict its likely 
continuation Xt- I assume that the strings which have to be continued are drawn 
from a "true" probability distribution /i. The maximal prior information a prediction 
algorithm can possess is the exact knowledge of /x, but often the true distribution is 
unknown. Instead, prediction is based on a guess p of fi. While I require /i to be a 
measure, I allow p to be a semimeasure [MZllHutOllS Formally, p : A:"* — > [0,1] is 
a semimeasure if p{x)>'^^^-^p{xa)\/xEX* , and a (probability) measure if equality 
holds and p(e) = l, where e is the empty string. p{x) denotes the p-probability that 
a sequence starts with string x. Further, p{a\x) : = p{xa) / p{x) is the "posterior" or 
"predictive" p-probability that the next symbol is a&X, given sequence xeX*. 

Bayes mixture. We may know or assume that p belongs to some countable class 

: = {z/i,z/2,...} 9p of semimeasures. Then we can use the weighted average on 
(Bayes-mixture, data evidence, marginal) 

^{x) := Wu-i^{x), '^Wy<l, Wu>0 (1) 

for prediction. One may interpret Wp = 'P[Hv] as prior belief in v and ^(x) = P[x] 
as the subjective probability of x, and p(x) = P[x|p] is the sampling distribution or 
likelihood. The most important property of semimeasure ^ is its dominance 

^(x) > WuV^x) Vx and Vz/GTW, in particular ^(x) > w^p(x) (2) 

which is a strong form of absolute continuity. 

Convergence for deterministic environments. In the predictive setting we are 
not interested in identifying the true environment, but to predict the next symbol 
well. Let us consider deterministic p first. An environment is called deterministic if 
Ai(«i:n) = IVn for some sequence a, and p = elsewhere (off-sequence). In this case 
we identify p with a and the following holds: 
00 

|1— ,^(af|a;<t)| < hiw'^'' and ^{at:n\ott) ~^ ^ for n > t ^ 00 (3) 

t=i 



^Readers unfamiliar or uneasy with semimeasures can without loss ignore this technicality. 
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where Wa>0 is the weight of a=^EA4. This shows that ^(at|a<t) rapidly converges 
to 1 and hence also ^(Q;t|«<t) —^0 for at 7^ at, and that ^ is also a good multi-step 
lookahead predictor. Proof: ^(ai:„)^c>0, since ^(ai:„) is monotone decreasing 
in n and ^(ai:ri,) > w^/u(ai:n) ='U^^ > 0. Hence ^(ai:„)/^(ai:t) ^c/c= 1 for any limit 
sequence t,n— >-cx). The bound follows from J2^^i^~^i^t\x<:t)<:—J^i^i^^C,{xt\x^t) = 
-ln^(xi:„) and ^{ai.,n)>Wa. 

Convergence in probabilistic environments. In the general probabilistic case 
we want to know how close ^(xt|x<t) is to the true probability yu(xf|x<t). One 
convenient distance measure is the (squared) Hellinger distance 

htiuj<t) ■■= X^(\/^(ak<i) - \/M^<t)y (4) 
One can show |Hut03at IHut04j that 



< 5^EN < D„(/x||0:=E[ln|^] < ln< (5) 

t=i 

The first two inequalities actually hold for any two (semi)measures, and the last 
inequality follows from ([2]). These bounds (with n = oo) imply /ij— >0 and hence 

^(xt|c<j<t) — fi{xt\uj^t) — ^ for any Xt and ^(^*|^^*) — ^ 1, both rapid w.p.l for t — > 00 

An improved bound E[exp(i^^/it)] <w;:^^^ |HM04j even shows that the probability 
that J2t^t additively exceeds Inw"^ by c (e.g. c> 10) is tiny e~^^'^. One can also show 
multi-step lookahead convergence ^{xunt\^<t) — fJ'(yXt;nt\^<t) ^0 (even for unbounded 
horizon l<nt— t + l— *>oo), which is interesting for delayed sequence prediction and 
in reactive environments [Hut 04] . Since ^ rapidly converges to /i, one can anticipate 
that also decisions based on ^ are good. 

Bayesian decisions. Let i^tyt ^ be the received loss when predicting yt&y, 
but Xt E X turns out to be the true t*'* symbol of the sequence. The p-optimal 
predictor 

yt'{^<t) ■= argmin Vp(xt|^<t)4tyt (6) 
yt ^ — ' 

minimizes the p-expected loss. For instance for A" = 3^ = {0,1}, Ap is a threshold 
strategy with ?/f'' = 0/l for p(l|co'<t) ^ 7, where 7: = • The instantaneous 

loss at time t and the total p(=true)-expected loss for the first n symbols are 

n 

It^iu^t) := E[C,A.|a;<t] and L^ := ^E[L,,Ap] (7) 

t=i 

Let A be any prediction scheme (deterministic or probabilistic) with no constraint 
at all, taking any action y^ Ey with total expected loss L^. If p is known, A^ is 



EE 



+ — 1 



g(t^t|t^<t) 
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obviously the best prediction scheme in the sense of achieving minimal expected loss 
L^'" < for any A. For the predictor based on the Bayes mixture ^, one can 
show (proof in Appendix [Aj see also |MF98t IHut03aj for related bounds) 

n n 

(v^-v^)' < Y.^[{V¥-V¥r] < E2E[/^*] (8) 

t=i t=i 

which actually holds for any two (semi)measures. Chaining with ([5]) implies, 
for instance, /^^ rapid w.p.l, ^/L^ exceeds \fTj^ by at most y/2limFT, 

L^^/L^f*— >-l for L^^— i>oo, or if is finite, then also L^. This shows that ^ (via 
Ag) performs also excellent from a decision-theoretic perspective, i.e. suffers loss only 
slightly larger than the optimal A^ predictor. 

One can also show that A^ is Pareto- optimal (admissible) in the sense that every 
other predictor with smaller loss than A^ in some environment uEAi must be worse 
in another environment |Hut03cj . 

Continuous environmental classes. I will argue later that countable A4 are 
sufficiently large from a philosophical and computational perspective. On the other 
hand, countable exclude all continuously parameterized families (like the class 
of all i.i.d. or Markov processes), common in statistical practice. I show that the 
bounds above remain approximately valid for most parametric model classes. Let 

M := {ue-Oee^ IR^} 

be a family of probability distributions parameterized by a d- dimensional continuous 
parameter ^, and ;U = i/^g G the true generating distribution. For a continuous 
weig ht densit}@ w{d) > the sums ([T]) are naturally replaced by integrals: 

aXl:n) := fw{e)-Mx,..n)de, ! w{9) d9 = 1 (9) 

Je Je 

The most important property of ^ was the dominance ([2]) achieved by dropping the 
sum over u. The analogous construction here is to restrict the integral over 6 to 
a small vicinity of 6q. Since a continuous parameter can typically be estimated to 
accuracy ocn~^/^ after n observations, the largest volume in which as a function 
of 9 is approximately flat is oc (n^^/^)'^, hence ^(a;i:„) > n^'^/^w(^^o)/^(a^i:n)- Under 
some weak regularity conditions one can prove [CBQOl IHut03c] 

D„(/x||0 := Elng^ < ln^(^o)-' + iln|^ + |lndetJ„(^o) + o(l) (10) 

where ^(6*0) is the weight density ([9]) of /i in ^, and o(l) tends to zero for n^oo, 
and the average Fisher information matrix jn{0) = eVjlnue^ui-n)] measures 

the local smoothness of ug and is bounded for many reasonable classes, including all 
stationary (/c*''-order) finite-state Markov processes. See SectionlUfor an application 

^w() will always denote densities, and wq probabilities. 



6 



to the i.i.d. (A; = 0) case. We see that in the continuous case, D„ is no longer bounded 
by a constant, but grows very slowly (logarithmically) with n, which still implies 
that e-deviations are exponentially seldom. Hence, (fTOl) allows to bound (jSj) and ([8]) 
even in case of continuous A4. 

3 How to Choose the Prior 

I showed in the last section how to predict if the true environment fi is unknown, but 
known to belong some class Ai of environments. In this section, I assume Ai to be 
given, and discuss how to (universally) choose the prior Wi,. After reviewing various 
classical principles (indifference, symmetry, minimax) for obtaining objective priors 
for "small" A4, I discuss large Ai. Occam's razor in conjunction with Epicurus' 
principle of multiple explanations, quantified by Kolmogorov complexity, leads us 
to a universal prior, which results in a better predictor than any other prior over 
countable At. 

Classical principles. The probability axioms (implying Bayes' rule) allow to com- 
pute posteriors and predictive distributions from prior ones, but are mute about how 
to choose the prior. Much has been written on the choice of priors (see |KW96j for 
a survey and references). A main classification is between objective and subjective 
priors. An objective prior is a prior constructed based on some rational princi- 
ples, which ideally everyone without (relevant) extra prior knowledge should adopt. 
In contrast, a subjective prior aims at modelling the agents personal (subjective) 
belief in environment u prior to observation of x, but based on his past personal 
experience or knowledge (e.g. of related phenomena). In Section El I show that one 
way to arrive at a subjective prior is to start with an objective prior, make all past 
personal experience explicit, determine a "posterior" and use it as subjective prior. 
So I concentrate in the following on the more important objective priors. 

Consider a very simple case of two environments, e.g. a biased coin with head 
or tail probability 1/3. In absence of any extra knowledge (which I henceforth 
assume) there is no reason to prefer head probability 6 = 1/3 over 6 = 2/3 and 
vice versa, leaving Wi/s = W2/3 = \ as the only rational choice. More generally, for 
finite A\, the symmetry or indifference argument |Lapl2| suggests to set '^u=\Xi\ 
Vi^gA^. Similarly for a compact measurable parameter space we may choose a 
uniform density w{6) = [Vol(0)]~^. But there is a problem: If we go to a different 
parametrization (e.g. 6^6':=\f6 in the Bernoulli case), the prior w{6) ^ w'{6') 
becomes non-uniform. Jeffreys' [Jef46'| solution is to find a symmetry group of the 
problem (like permutations for finite At) and require the prior to be invariant under 
group transformations. For instance, if 6'GiR is a location parameter (e.g. the mean) 
it is natural to require a translation-invariant prior. Problems are that there may be 
no obvious symmetry, the resulting prior may be improper (like for the translation 
group), and the result can depend on which parameters are treated as nuisance 
parameters. 
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The maximum entropy principle extends the symmetry principle by allowing 
certain types of constraints on the parameters. Conjugate priors are classes of priors 
such that the posteriors are themselves again in the class. While this can lead to 
interesting classes, the principle itself is not selective, since e.g. the class of all priors 
forms a conjugate class. 

Another minimax approachhj Bernardo [Ber79t [CB90j is to consider bound (11 01) . 
which can actually be improved within o(l) to an equality. Since we want D„ to 
be small, we minimize the r.h.s. for the worst fiEM.. Choice w{6) oc ^/detJ^JO) 
equalizes and hence minimizes ffTOl) . The problems are the same as for Jeffrey's 
prior (actually often both priors coincide), and also the dependence on the model 
class and potentially on n. 

The principles above, although not unproblematic, can provide good objective 
priors in many cases of small discrete or compact spaces, but we will meet some more 
problems later. For "large" model classes I am interested in, i.e. countably infinite, 
non-compact, or non-parametric spaces, the principles typically do not apply or 
break down. 

Occam's razor et al. Machine learning, the computer science branch of statis- 
tics, often deals with very large model classes. Naturally, machine learning has 
(re) discovered and exploited quite different principles for choosing priors, appropri- 
ate for this situation. The overarching principles put together by Solomonoff |Sol64j 
are: Occam's razor (choose the simplest model consistent with the data), Epicurus' 
principle of multiple explanations (keep all explanations consistent with the data), 
(Universal) Turing machines (to compute, quantify and assign codes to all quanti- 
ties of interest), and Kolmogorov complexity (to define what simplicity/complexity 
means) . 

I will first "derive" the so called universal prior, and subsequently justify it by 
presenting various welcome theoretical properties and by examples. The idea is that 
a priori, i.e. before seeing the data, all models are "consistent," so a-priori Epicurus 
would regard all models (in A^) possible, i.e. choose Wy>0 Vi/eA4. In order to also 
do (some) justice to Occam's razor we should prefer simple hypotheses, i.e. assign 
high prior (low) prior Wu to simple (complex) hypotheses H^. Before I can define 
this prior, I need to quantify the notion of complexity. 

Notation. A function / : 5 ^ iRU{±oo} is said to be lower semi-computable (or 
enumerable) if the set {{x,y) : y < f{x),x ES,y eQ} is recursively enumerable. / 
is upper semi-computable (or co-enumerable) if — / is enumerable. / is computable 
(or recursive) if / and — / are enumerable. The set of (co)enumerable functions 
is recursively enumerable. I write 0(1) for a constant of reasonable size: For in- 
stance, a sequence of length 100 is reasonable, maybe even 2^°, but 2^°° is not. I 
write f{x)<g{x) for f{x)<g{x) + 0{l) and f{x)<g{x) for f {x) <2^'^^'> ■ g{x) . Cor- 
responding equalities hold if the inequalities hold in both directions^] We say that 
a property A{n)E{trueJalse} holds for mostn, if ^{t<n: A{t)} 

will ignore these additive and multiplicative fudges in the discussion till Section [S] 
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Kolmogorov complexity. We can now quantify tlie complexity of a string. Intu- 
itively, a string is simple if it can be described in a few words, like "the string of one 
million ones" , and is complex if there is no such short description, like for a random 
object whose shortest description is specifying it bit by bit. We are interested in 
effective descriptions, and hence restrict decoders to be Turing machines (TMs). 
Let us choose some universal (so-called prefix) Turing machine U with binary in- 
put=program tape, Xaxy output tape, and bidirectional work tape. We can then 
define the prefix Kolmogorov complexity |Cha75t IGac74l IKol65t ILev74j of string x as 
the length d. of the shortest binary program p for which U outputs x: 



Simple strings like 000. ..0 can be generated by short programs, and, hence have low 
Kolmogorov complexity, but irregular (e.g. random) strings are their own shortest 
description, and hence have high Kolmogorov complexity. For non-string objects 
(like numbers and functions) we define K{o) ■.= K{{o)), where (a) & X* is some 
standard code for a. In particular, if (/i)^i is an enumeration of all (co) enumerable 
functions, we define K{fi)=K{i). 

An important property of K is that it is nearly independent of the choice of 
U . More precisely, if we switch from one universal TM to another, K{x) changes 
at most by an additive constant independent of x. For natural universal TMs, the 
compiler constant is of reasonable size 0(1). A defining property of K : X* IN 
is that it additively dominates all co-enumerable functions / : X* IN that satisfy 
Kraft's inequality ^^2"^^"^^ < 1, i-e. K{x)^f{x) for K{f)=0{l). The universal 
TM provides a shorter prefix code than any other effective prefix code. K shares 
many properties with Shannon's entropy (information measure) S, but K is superior 
to S in many respects. To be brief, K is an excellent universal complexity measure, 
suitable for quantifying Occam's razor. We need the following properties of K: 

a) K is not computable, but only upper semi-computable, 

h) the upper bound if (n) < log2n+21og2logn, (11) 

c) Kraft's inequality 'Ylix'^~^'^^^ — which implies 2~^*^"') < - for most n, 

d) information non-increase K{f{x)) < K{x) + K{f) for recursive f:X*^X*, 

e) K{x) ^ -\og2Pix) + K{P) iiP -.X*^ [0,1] is enumerable and T^x^i^) < 1' 
/) Ex:/(x)=j/2~-^^''^ = 2-^(^) if / is recursive and K{f)=0{l). 

The proof of (/) can be found in Appendix lAl and the proofs of (a) — (e) in |LV97j . 

The universal prior. We can now quantify a prior biased towards simple models. 
First, we quantify the complexity of an environment u or hypothesis H,^ by its 
Kolmogorov complexity K{v). The universal prior should be a decreasing function 
in the model's complexity, and of course sum to (less than) one. Since K satisfies 
Kraft's inequality ffTTb). this suggests the following choice: 



K{x) 



min{£(p) : U{p) = x} 



p 



= w, 



u 



2 



K{v) 



(12) 



V 
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For this choice, the bound ([5]) on Dn (which bounds ([5]) and ([8])) reads 

EZiHht] < < K(^)ln2 (13) 

i.e. the number of times, ^ deviates from /i or deviates from l^i^ by more than e>0 
is bounded by 0{K{fi)), i.e. is proportional to the complexity of the environment. 
Could other choices for Wi, lead to better bounds? The answer is essentially no 
|Hut04j : Consider any other reasonable prior w^, where reasonable means (lower 
semi) computable with a program of size 0(1). Then, MDL bound ffTTk ) with P{)^ 
w'q and {fi) shows K{n)< —\og2w'^ + K{w'^-^), hence liaw'~^ > K {fi)\n2 leads 
(within an additive constant) to a weaker bound. A counting argument also shows 
that 0{K{^)) errors for most /i are unavoidable. So this choice of prior leads to 
very good prediction. 

Even for continuous classes Ai, we can assign a (proper) universal prior (not 
density) Wg = 2~^^^^ > for computable 9, and for uncomputable ones. This 
effectively reduces to a discrete class {uq&M. :w^>0} which is typically dense 
in M.. We will see that this prior has many advantages over the classical prior 
densities. 



4 Independent Identically Distributed Data 

I now compare the classical continuous prior densities to the universal prior on classes 
of i.i.d. environments. I present some standard critiques to the former, illustrated 
on Bayes-Laplace's classical Bernoulli class with uniform prior: the problem of zero 
p(oste)rior, non-confirmation of universal hypotheses, and reparametrization and 
regrouping non-invariance. I show that the universal prior does not suffer from 
these problems. Finally I complement the general total bounds of Section [2] with 
some i.i.d. -specific instantaneous bounds. 

Laplace's rule for Bernoulli sequences. Let x = a;ia;2...x„ G A"" = {0,1}" be 
generated by a biased coin with head=l probability 6' G [0,1], i.e. the likelihood of 
X under hypothesis He is ue{,x) = 'P[x\He\ = 6'^^{l — 6Y'-\ where ni = xi + ... + Xn = 
n — UQ. Bayes |Bay63| assumed a uniform prior density w{6) = l. The evidence is 
^(x) = Qi'0{x)'w{6) dO = I and the posterior probability weight density ^(6*10;) = 

i'0{x)w{9) / S,{x) = ^n^nol ^"^^^'^ — 6*)"° of 9 after seeing x is strongly peaked around 
the frequency estimate 6 = — for large n. Laplace |Lapl2| asked for the predictive 
probability ^(l|a:) of observing x„+i = l after having seen x=xi...Xn, which is ^(l|x) = 
= (Laplace believed that the sun had risen for 5 000 years = 1826 213 

days since creation, so he concluded that the probability of doom, i.e. that the sun 
won't rise tomorrow is 1326215 •) '^^^^ looks like a reasonable estimate, since it is 
close to the relative frequency, asymptotically consistent, symmetric, even defined 
for n = 0, and not overconfident (never assigns probability 1). 

The problem of zero prior. But also Laplace's rule is not without problems. The 
appropriateness of the uniform prior has been questioned in Section [3] and will be 
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detailed below. Here I discuss a version of the zero prior problem. If the prior is zero, 
then the posterior is necessarily also zero. The above example seems unproblematic, 
since the prior and posterior densities w{6) and w{6\x) are non-zero. Nevertheless 
it is problematic e.g. in the context of scientific confirmation theory |Ear93j . 

Consider the hypothesis H that all balls in some urn, or all ravens, are black 
(=1). A natural model is to assume that balls (or ravens) are drawn randomly from 
an infinite population with fraction 6 of black balls (or ravens) and to assume a 
uniform prior over 9, i.e. just the Bayes-Laplace model. Now we draw n objects and 
observe that they are all black. 

We may formalize H as the hypothesis H' := {6 = 1}. Although the posterior 
probability of the relaxed hypothesis H, := {e>l-e}, F[H,\1''] = J^_w{e\l'') de = 
J^_^{n+l)9''d9 = l-{l-e)''+^ tends to 1 for n^oo for every fixed e>0, F[H'\r] = 
P[i7o|l"] remains identically zero, i.e. no amount of evidence can confirm H'. The 
reason is simply that zero prior P[if'] = implies zero posterior. 

Note that H' refers to the unobservable quantity 6 and only demands blackness 
with probability 1. So maybe a better formalization of H is purely in terms of obser- 
vational quantities: iJ":={co'i:oo = l°°}. Since = the predictive probability 
of observing k further black objects is ^(1^|1") = ^^^(^in-j^ = n+t+i • While for fixed k 
this tends to 1, P[i/"|1"] =limfc^oo'^(l^|l") = Vn, as for H'. 

One may speculate that the crux is the infinite population. But for a finite 
population of size and sampling with (similarly without) repetition, P[if"|l"'] = 
^j^]^Ar-n|]^n-j _ n±}_ (,jQgg Qj^g only if a large fraction of objects has been observed. 
This contradicts scientific practice: Although only a tiny fraction of all existing 
ravens have been observed, we regard this as sufficient evidence for believing strongly 
in H. This quantifies |Mah04l Thm.ll] and shows that Maher does not solve the 
problem of confirmation of universal hypotheses. 

There are two solutions of this problem: We may abandon strict /logical/all- 
quantified/universal hypotheses altogether in favor of soft hypotheses like if^. Al- 
though not unreasonable, this approach is unattractive for several reasons. The other 
solution is to assign a non-zero prior to 6 = 1. Consider, for instance, the improper 
density w{e) = l[l+6{l-e)], where 6 is the Dirac-delta {Jf{e)6{e-a) de = f{a)\ or 
equivalently V[e>a] = l-\a. We get e(xi:„) = \ [f^ + 5ono] , where 5^,- = { J "^^^ } 
is Kronecker's 5. In particular ^(1") = is much larger than for uniform prior. 
Since e(l'=|l") = ^±|±f we get V[H"\h] = Yimu^^i{l^\l^) = ^^^l, i.e. H" gets 
strongly confirmed by observing a reasonable number of black objects. This correct 
asymptotics also follows from the general result ([3]). Confirmation of H" is also 
refiected in the fact that ^(0|1") = (^^^^2)'^ tends much faster to zero than for uniform 
prior, i.e. the confidence that the next object is black is much higher. The power 
actually depends on the shape of w{6) around 6 = 1. Similarly H' gets confirmed: 
P[i/'|l"]=/ii(l")P[^ = l]/^(l") = ^^l. On the other hand, if a single (or more) 
are observed (no>0), then the predictive distribution i{-\x) and posterior w{6\x) 
are the same as for uniform prior. 
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The findings above remain qualitatively valid for i.i.d. processes over finite non- 
binary alphabet jA"! >2 and for non-uniform prior. 

Surely to get a generally working setup, we should also assign a non-zero prior 
to 6' = and to all other "special" 9, like | and |, which may naturally appear in a 
hypothesis, like "is the coin or die fair". The natural continuation of this thought 
is to assign non-zero prior to all computable 6. This is another motivation for the 
universal prior Wg = 2~^^^^ ( IT2|) constructed in Section [31 It is difficult but not 
impossible to operate with such a prior |PH04l IPHOGj . One may want to mix the 
discrete prior with a continuous (e.g. uniform) prior density, so that the set of 
non-computable 6 keeps a non-zero density. Although possible, we will see that this 
is actually not necessary. 

Reparametrization invariance. Naively, the uniform prior is justified by the in- 
difference principle, but as discussed in Section 121 uniformity is not reparametriza- 
tion invariant. For instance if in our Bernoulli example we introduce a new 
parametrization 6' = ^/6, then the ^'-density w'{6') = 2\/6w{6) is no longer uniform 
if w{6) = l is uniform. 

More generally, assume we have some principle which leads to some prior w{6). 
Now we apply the principle to a different parametrization 6' and get prior w'{6'). 
Assume that 6 and 6' are related via bijection 6 = f{6'). Another way to get a ^'-prior 
is to transform the ^-prior w{6)^w{6'). The reparametrization invariance principle 
(RIP) states that w' should be equal to w. 

For discrete B, simply wgr =wj(£i/), and a uniform prior remains uniform [w'g, = 
We' =W0 = j^^) in any parametrization, i.e. the indifference principle satisfies RIP in 
finite model classes. 

In case of densities, we have w{6') = w{f{9')) '''^jg, , and the indifference principle 
violates RIP for non-linear transformations /. But Jeffrey's and Bernardo's principle 
satisfy RIP. For instance, in the Bernoulli case we have jn{0) = l + TZg; hence w{9) = 

i[^(l-^)]-V2and^'(^') = ^[/(^')(l-/(^0r^/^^ = K^^')- 

Does the universal prior = 2 ^^^^ satisfy RIP? If we apply the "univer- 
sality principle" to a ^'-parametrization, we get w'J/ = 2-^^^'\ On the other 
hand, wq simply transforms to w^, = w^^g,-^ = 2~^^^^^">'> (wg is a discrete (non- 
density) prior, which is non-zero on a discrete subset of Ai). For computable / 
we have K{f{e'))^K{e') + K{f) by ([IB), and similarly K{f-\e))^K{e) + K{f) 
if / is invertible. Hence for simple bijections / i.e. for K{f) = 0(1), we have 
K{f{9')) = K{9'), which implies w'qV — w^,, i.e. the universal prior satisfies RIP w.r.t. 
simple transformations / (within a multiplicative constant). 

Regrouping invariance. There are important transformations / which are not 
bijections, which we consider in the following. A simple non-bijection is 6 = f{6') = 
6'"^ if we consider 0' G [—1,1]- More interesting is the following example: Assume 
we had decided not to record blackness versus non-blackness of objects, but their 
"color". For simplicity of exposition assume we record only whether an object is 
black or white or colored, i.e. X' = {B,W,C}. In analogy to the binary case we 
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use the indifference principle to assign a uniform prior on 0' G G' := A3, where 
:= {61' e [0,1]"^ : 'ELA = 1}, and iye'ix[.J = Hi^r^- All inferences regarding 
blackness (predictive and posterior) are identical to the binomial model I'eixi.n) = 
with x't = B xt = l and x't = W 01 C ^ xt = and 9 = f{0') = e'B and 
w{6) = J^^w' {0')6{6'^—6)dO' . Unfortunately, for uniform prior w'(0')ocl, w{9) ocl— 9 
is nonuniform, i.e. the indifference principle is noi invariant under splitting/grouping, 
or general regrouping. Regrouping invariance is regarded as a very important and 
desirable property |Wal96j . 

I now consider general i.i.d. processes i^6»(a^) = 11^=1^?' • Dhichlet priors w{9) oc 
nf=i^i"'~^ form a natural conjugate class {w{9\x)(xY[f=iO^'^°''~^) and are the default 
priors for multinomial (i.i.d.) processes over finite alphabet X of size d. Note that 
^(a|x) = ^_)_^°^"!)_^^ generalizes Laplace's rule and coincides with Carnap's |Car52j 
confirmation function. Symmetry demands ai = ... = ad', for instance a = l for uni- 
form and a = | for Bernard- Jeffrey's prior. Grouping two "colors" i and j results in 
a Dirichlet prior with ai^j = ai-\-aj for the group. The only way to respect symmetry 
under all possible groupings is to set a = 0. This is Haldane's improper prior, which 
results in unacceptably overconfident predictions ^(111"") = 1. Walley |Wal96] solves 
the problem that there is no single acceptable prior density by considering sets of 
priors. 

I now show that the universal prior Wg^ = 2~^^^^ is invariant under regrouping, 
and more generally under all simple (computable with complexity 0(1)) even non- 
bijective transformations. Consider prior w'q,. If = f{9') then w'^t transforms to 
we = YliB' f(e')=e'^'e' (note that for non-bijections there is more than one w'gi consistent 
with wg). In ^'-parametrization, the universal prior reads w;F = 2-^(^'). Using ^) 
with X = {9') and y = {9) we get 

e':f(e')=e 

i.e. the universal prior is general transformation and hence regrouping invariant 
(within a multiplicative constant) w.r.t. simple computable transformations /. 

Note that reparametrization and regrouping invariance hold for arbitrary classes 
M. and are not limited to the i.i.d. case. 

Instantaneous bounds. The cumulative bounds ([5]) and (fTOj) stay valid for i.i.d. 
processes, but instantaneous bounds are now also possible. For i.i.d. M. with con- 
tinuous, discrete, and universal prior, respectively, one can show (in preparation; 
see |Kri98l IPH041 IPH06] for related bounds) 

E[/i„] ^ l\nw{9,)-^ and E[/i„] ^ ^Innj-^ = lK{9^)\n2 (14) 

Note that, if summed up over n, they lead to weaker cumulative bounds. 
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5 Universal Sequence Prediction 



Section [3] derived the universal prior and Section H] discussed i.i.d. classes. What 
remains and will be done in this section is to find a universal class of environments, 
namely Solomonoff-Levin's class of all (lower semi) computable (semi) measures. The 
resulting universal mixture is equivalent to the output distribution of a universal 
Turing machine with uniform input distribution. The universal prior avoids the 
problem of old evidence and the universal class avoids the necessity of updating A^. 
I discuss the general total bounds of Section [2] for the specific universal mixture, 
and supplement them with some weak instantaneous bounds. Finally, I show that 
the universal mixture performs better than classical continuous mixtures, even in 
uncomputable environments. 

Universal choice of Al. The bounds of Section [2] apply if Ai contains the true 
environment /i. The larger Ai the less restrictive is this assumption. The class 
of all computable distributions, although only countable, is pretty large from a 
practical point of view. (Finding a non-computable physical system would overturn 
the Church- Turing thesis.) It is the largest class, relevant from a computational 
point of view. Solomonoff |Sol64t Eq.(13)] defined and studied the mixture over this 
class. 

One problem is that this class is not enumerable, since the class of computable 
functions f -.X* -^M is not enumerable (halting problem), nor is it decidable whether 
a function is a measure. Hence ^ is completely incomputable. Levin jZL70] had the 
idea to "slightly" extend the class and include also lower semi-computable semimea- 
sures. One can show that this class Aiu = {^i,^2,---} is enumerable, hence 



is itself lower semi-computable, i.e. ^c/GA^[/, which is a convenient property in itself. 
Note that since ^y^^i^ < < ^ for most n by (11 lb ) and (11 lb ) , most I'n have prior 
approximately reciprocal to their index n, as advocated by Jeffreys [JefGll p238] and 
Rissanen |Ris83] . 

In some sense A4u is the largest class of environments for which ^ is in some 
sense computable |Hut03b[ IHutOGj . but see |Sch02aj for even larger classes. Note 
that including non-semi-computable u would not affect ^u, since = on such 
environments. 

The problem of old evidence. An important problem in Bayesian inference 
in general and (Bayesian) confirmation theory [Ear93j in particular is how to deal 
with 'old evidence' or equivalently with 'new theories'. How shall a Bayesian treat 
the case when some evidence E=x (e.g. Mercury's perihelion advance) is known 
well-before the correct hypothesis/theory/model H=n (Einstein's general relativity 
theory) is found? How shall H be added to the Bayesian machinery a posteriori? 
What is the prior of HI Should it be the belief in if in a hypothetical counterfactual 




(15) 
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world in which E is not known? Can old evidence E confirm HI After all, H could 
simply be constructed/biased/fitted towards "explaining" E. 

The universal class Aiu and universal prior formally solve this problem: 
The universal prior of H is 2~^^^\ This is independent of Ai and of whether 
E is known or not. If we use E to construct if or fit if to explain E, this will 
lead to a theory which is more complex {K{H) > K{E)) than a theory from scratch 
{K{H) = 0{1)), so cheats are automatically penalized. There is no problem of adding 
hypotheses to Ai a posteriori. Priors of old hypotheses are not affected. Finally, 
Atu includes all hypotheses (including yet unknown or unnamed ones) a priori. So 
at least theoretically, updating is unnecessary. 

Other representations of ^jj. Definition f|T5|) is somewhat complex, relying on 
enumeration of semimeasures and Kolmogorov complexity. I now approach ^[/ from 
a different perspective. Assume that our world is governed by a computable deter- 
ministic process describable in < / bits. Consider a standard (not prefix) Turing 
machine U' and programs p generating environments starting with x. Let us pad 
all programs so that they have length exactly /. Among the 2' programs of length 
/ there are Ni{x) :=i^{p&{0,iy :U'{p)=x*} programs consistent with observation 
X. If we regard all environmental descriptions pE {0,1}' a priori as equally likely 
(Epicurus) we should adopt the relative frequency Ni{x)/2^ as our prior belief in x. 
Since we do not know / and we can pad every p arbitrarily, we could take the limit 
M{x) ■.= \im.i^aoNi{x) (which exists, since Ni{x)/2^ increases). Or equivalently: 
M[x) is the probability that U' outputs a string starting with x when provided with 
uniform random noise on the program tape. Note that a uniform distribution is also 
used in the No Free Lunch theorems |WM97j to prove the impossibility of universal 
learners, but in our case the uniform distribution is piped through a universal Turing 
machine which defeats these negative implications. Yet another representation of M 
is as follows: For every q printing x* there exists a shortest prefix (called minimal) 
p of q printing x. p possesses 2'~^^^^ prolongations to length /, all printing x*. Hence 
all prolongations of p together yield a contribution 2'~^(p)/2' = 2"^^^^ to M{x). Let 
U{p)=x* iff p is a minimal program printing a string starting with x. Then 



which may be regarded as a 2 -weighted mixture over all computable determin- 
istic environments Up {i'p{x) = l if U{p)=x* and else). Now, as a positive surprise, 
M{x) coincides with C,u{x) within an irrelevant multiplicative constant. So it is ac- 
tually sufficient to consider the class of deterministic semimeasures. The reason is 
that the probabilistic semimeasures are in the convex hull of the deterministic ones, 
and so need not be taken extra into account in the mixture. One can also get an 
explicit enumeration of all lower semi-computable semimeasures A4u = {i'i,i'2,...} by 
means of i^i{x):=J2p-Ti(p)=x*'^~^^^^^ where Ti{p) = U{{i)p), i = l,2,... is an enumeration 
of all monotone Turing machines. 




(16) 



p:U (p)=x* 
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Bounds for computable environments. The bound (fT3|) surely is applicable for 
^ = ^(7 and now holds for any computable measure /i. Within an additive constant 
the bound is also valid for M = That is, one? M are excellent predictors 
with the only condition that the sequence is drawn from any computable probability 
distribution. Bound (1131) shows that the total number of prediction errors is small. 
Similarly to one can show that ^"=i|l — M(x(|x<t)| < Km{xi-n)in2, where the 
monotone complexity Km{x) := min{£(p) : U{p) = x*} is defined as the length of 
the shortest (nonhalting) program computing a string starting with x |ZL70t ILV971 
IHutn4] . 

If Xi:oo is a computable sequence, then Km^xi.oo) is finite, which implies 
M{xt\x^t) on every computable sequence. This means that if the environment 
is a computable sequence (whichsoever, e.g. 1°° or the digits of vr or e), after having 
seen the first few digits, M correctly predicts the next digit with high probability, 
i.e. it recognizes the structure of the sequence. In particular, observing an increas- 
ing number of black balls or black ravens or sunrises, M(l|l") — ^1 {Km{l°°)=0{l)) 
becomes rapidly confident that future balls and ravens are black and that the sun 
will rise tomorrow. 

Total bounds ([3]) and f|T3l) are suitable in an online setting, but given a fixed 
number of n observations, they give no guarantee on the next instance. 

Weak instantaneous bounds. In Section HI I derived good instantaneous bounds 
for i.i.d. classes. For coin or die fiips or balls drawn from an urn this model is ap- 
propriate. But ornithologists do not really sample ravens independently at random. 
Although not strictly valid, the i.i.d. model may in this case still serve as a useful 
proxy for the true process. But to model the rise of the sun as an i.i.d. process is 
more than questionable. On the other hand it is plausible that these examples (and 
other processes like weather or stock-market) are governed by some (probabilistic) 
computable process. So model class J^jj and predictor M seem appropriate. While 
excellent total bounds ([3]) and f[T^ exist, the essentially only instantaneous bound I 
was able to derive (proof in Appendix |A]) is 

2'K{n) ^ M(x„|x<„) ^ 22^'"("i^")-^(") (17) 

valid for all n and xi;n and Xn^Xn- I discuss the bound for the sequence xi;oo = ^°°, 
but most of what I say remains valid for any other computable sequence. Since 
i^m(l")=0(l), we get 

M{o\r) ^ 2-^(") 

Since 2'^^"^^ < ^ for most n, this shows that M quickly disbelieves in non-black 
objects and doomsday, similarly as in the i.i.d. model, but now only for most n. 

Magic numbers. This 'most' qualification has interesting consequences: M(0|1"^) 
spikes up for simple n. So M is cautious at magic instance numbers, e.g. fears 
doom on day 2^° more than on a comparable random day. While this looks odd 
and pours water on the mills of prophets, it is not completely absurd. For instance. 
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major software problems have been anticipated for the magic date, 1st of January 
2000. There are many other occasions, where something happens at "magic" dates 
or instances; for instance solar eclipses. 

Also, certain processes in nature follow fast growing sequences like those of the 
powers of two (e.g. the number of cells in an early human embryo) or the Fibonacci 
numbers (e.g. the number of petals or the arrangement of seeds in some flowers). 
Finally, that numbers with low (Kolmogorov) complexity cause high probability in 
real data bases can readily be verified by counting their frequency in the world wide 
web with Google [CV06]. 

On the other hand, (returning to sequence prediction) on most simple dates, 
nothing exceptional happens. Due to the total bound ^^o^(0|l") ^ 0(1), M 
cannot spike up too much too often. M tells us to be more prepared but not to 
expect the unexpected on those days. Another issue is that often we do not know the 
exact start of the sequence. How many ravens exactly have ornithologists observed, 
and how many days exactly did the sun rise so far? In absence of this knowledge 
we need to Bayes-average over the sequence length which will wash out the spikes. 

Universal is better than continuous A4.. Although I argued that incom- 
putable environments fi can safely be ignored, one may be nevertheless uneasy using 
Solomonoff 's M = S,u ( fT6l) if outperformed by a continuous mixture ^ (E]) on such 
II E M.\M.Ui for instance if M would fail to predict a Bernoulli(6') sequence for 
incomputable 9. Luckily this is not the case: Although vq{) and wq can be incom- 
putable, the studied classes M. themselves, i.e. the two-argument function z^()(), and 
the weight function and hence ^(), are typically computable (the integral can 
be approximated to arbitrary precision). Hence M(x) — ^[/(x) > 2~-'^'^^)^(a;) by f[T^ 
and K{^) is often quite small. This implies for all fi 

DM\M) ^ E[\n^^] = E[ln|^]+E[ln|gif)] ^ D„(^||0 + K(0 ln2 

So any bound (fTOj) for D„(yu||^) is directly valid also for D„(/i||M), save an additive 
constant. That is, M is superior (or equal) to all computable mixture predictors 
^ based on any (continuous or discrete) model class and weight w{9), even if 
environment n is not computable. Furthermore, while for essentially all parametric 
classes, D„(/i||^)~|lnn grows logarithmically in n for all (inch computable) ^EAi, 
Dn{fi\\M) < K{n)\n2 is finite for computable fi. Bernardo's prior even implies a 
bound for M that is uniform (minimax) in ^G0. Many other priors based on rea- 
sonable principles are argued for (see Section [3] and |KW96] ). The above shows that 
M is superior to all of them. Actually the existence of any computable probabilistic 
predictor p with Dn{fi\\p) =o{n) is sufficient for M to predict /i equally well. 

Future bounds. Another important question is how many errors are still to come 
after some grace or learning period. Formally, given how large is the future 
expected error r„:=^^^_^^E[/it|co'i:n = Xi:„]? The total bound (l5l) + (fT3|) only implies 
that Vn asymptotically tends to zero w.p.l, and the instantaneous bounds f|T^ and 
(ITTl) are weak and do not sum up finitely. Since the complexity of /i bounds the total 
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loss, a natural guess is that something like the conditional complexity of /x given x 
(on an extra input tape) bounds the future loss. Indeed one can show |Hut04l ICHOSj 

oo 

J2 E[/itki:n] ^ [K{fi\uj,.,^)+K{n)]\n2 (18) 

t=n+l 

i.e. if our past observations uJi;n contain a lot of information about /i, we make 
few errors in future. For instance, consider the large space X of pixel images, and 
all observations are identical fi = uj = xiXiXi..., where xi is a "typical" image of 
complexity, say, K{xi) = 10^ = Km{uo). Obviously, after seeing a couple of identical 
images we expect the next one to be the same again. While total bound f|T3|) quite 
uselessly tells us that M makes less than 10^ errors, future bound f|T8l) with n = l 
shows that M makes only K{^\xi) = 0{l) errors. The K{n) term can be improved 
to the complexity of the randomness deficiency of uji;n if a more suitable variant of 
algorithmic complexity that is monotone in its condition is used |CH05t ICHS07] . 
No future bounds analogous to ( fT8|) for general prior or class are known. 

6 Discussion 

Critique and problems. In practice we often have extra information about the 
problem at hand, which could and should be used to guide the forecasting. One 
way is to explicate all our prior knowledge y and place it on an extra input tape of 
our universal Turing machine [/, which leads to the conditional complexity K{-\y). 
We now assign "subjective" prior w^t^y = 2~^^'^^y'> to environment i/, which is large 
for those v that are simple (have short description) relative to our background 
knowledge y. Since K{^\y) <K{iJi), extra knowledge never misguides (see ffT51) ). 
Alternatively we could prefix our observation sequence x hy y and use M{yx) for 
prediction |Hut04] . 

Another critique concerns the dependence of K and M on U. Predictions for 
short sequences x (shorter than typical compiler lengths) can be arbitrary. But 
taking into account our (whole) scientific prior knowledge y, and predicting the now 
long string yx leads to good (less sensitive to "reasonable" U) predictions |Hut04] . 
For an interesting attempt to make M unique see |Mul06j . 

Finally, K and M can serve as "gold standards" which practitioners should aim 
at, but since they are only semi-computable, they have to be (crudely) approximated 
in practice. Levin complexity |LV97j , the speed prior |Sch02b] , the minimal message 
and description length principles [Ris89l IWalOS] , and off-the-shelf compressors like 
Lempel-Ziv |LZ76j are such approximations, which have been successfully applied 
to a plethora of problems [CV05t ISch04j . 

Summary. I compared traditional Bayesian sequence prediction based on con- 
tinuous classes and prior densities to Solomonoff's universal predictor M, prior 
w^, and class J^u- I discussed the following advantages (+) and problems (— ) 
of Solomonoff's approach: 
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+ general total bounds for generic class, prior, and loss, 

+ universal and i.i.d.-specific instantaneous and future bounds, 

+ the Dn bound for continuous classes, 

+ indifference/symmetry principles, 

+ the problem of zero p(oste)rior and confirmation of universal hypotheses, 

+ reparametrization and regrouping invariance, 

+ the problem of old evidence and updating, 

+ that M works even in non-computable environments, 

+ how to incorporate prior knowledge, 

— the prediction of short sequences, 

— the constant fudges in all results and the [/-dependence, 

— M's incomput ability and crude practical approximations. 

In short, universal prediction solves or avoids or meliorates many foundational and 
philosophical problems, but has to be compromised in practice. 

Conclusion. The goal of the paper was to establish a single, universal theory 
for (sequence) prediction and (hypothesis) confirmation, applicable to all inductive 
inference problems. I started by showing that Bayesian prediction is consistent for 
any countable model class, provided it contains the true distribution. The major 
(agonizing) problem Bayesian statistics leaves open is how to choose the model 
class and the prior. Solomonoff's theory fills this gap by choosing the class of 
all computable (stochastic) models, and a universal prior inspired by Ockham and 
Epicurus, and quantified by Kolmogorov complexity. I discussed in breadth how 
and in which sense this theory solves the inductive inference problem, by studying 
a plethora of problems other approaches suffer from. In one line: All you need 
for universal prediction is Ockham, Epicurus, Bayes, Solomonoff, Kolmogorov, and 
Turing. By including Bellman, one can extend this theory to universal decisions in 
reactive environments |Hut04] . 

Acknowledgements. I would like to thank Frank Stephan for his detailed feedback 
on earlier drafts. 



References 

[Bay63] T. Bayes. An essay towards solving a problem in the doctrine of chances. Philo- 
sophical Transactions of the Royal Society, 53:376-398, 1763. [Reprinted in 
Biometrika, 45, 296-315, 1958]. 

[Ber79] J. M. Bernardo. Reference posterior distributions for Bayesian inference (with 
discussion). Journal of the Royal Statistical Society, B41:113-147, 1979. 

[Car52] R. Carnap. The Continuum of Inductive Methods. University of Chicago Press, 
Chicago, 1952. 

[CB90] B. S. Clarke and A. R. Barron. Information-theoretic asymptotics of Bayes 
methods. IEEE Transactions on Information Theory, 36:453-471, 1990. 



19 



[CH05] A. Chernov and M. Hutter. Monotone conditional complexity bounds on fu- 
ture prediction errors. In Proc. 16th International Conf. on Algorithmic Learn- 
ing Theory (ALT'05), volume 3734 of LNAI, pages 414-428, Singapore, 2005. 
Springer, Berlin. 

[Cha75] G. J. Chaitin. A theory of program size formally identical to information theory. 
Journal of the ACM, 22(3):329-340, 1975. 

[CHS07] A. Chernov, M. Hutter, and J. Schmidhuber. Algorithmic complexity bounds on 
future prediction errors. Information and Computation, 205(2):242-261, 2007. 

[CV05] R. Cilibrasi and P. M. B. Vitanyi. Clustering by compression. IEEE Trans. 
Information Theory, 51 (4): 1523-1545, 2005. 

[CV06] R. Cilibrasi and P. M. B. Vitanyi. Similarity of objects and the meaning of 
words. In Proc. 3rd Annual Conferene on Theory and Applications of Models of 
Computation (TAMC'06), volume 3959 of LNCS, pages 21-45. Springer, 2006. 

[Daw84] A. P. Dawid. Statistical theory. The prequential approach. Journal of the Royal 
Statistical Society, Series A 147:278-292, 1984. 

[Ear93] J. Earman. Bayes or Bust? A Critical Examination of Bayesian Confirmation 
Theory. MIT Press, Cambridge, MA, 1993. 

[Gac74] P. Gacs. On the symmetry of algorithmic information. Soviet Mathematics 
Doklady, 15:1477-1480, 1974. 

[Gac83] P. Gacs. On the relation between descriptional complexity and algorithmic 
probability. Theoretical Computer Science, 22:71-93, 1983. 

[HM04] M. Hutter and An. A. Muchnik. Universal convergence of semimeasures on 
individual random sequences. In Proc. 15th International Conf. on Algorithmic 
Learning Theory (ALT'04), volume 3244 of LNAI, pages 234-248, Padova, 2004. 
Springer, Berlin. 

[HutOl] M. Hutter. New error bounds for Solomonoff prediction. Journal of Computer 
and System Sciences, 62(4):653-667, 2001. 

[Hut03a] M. Hutter. Convergence and loss bounds for Bayesian sequence prediction. IEEE 
Transactions on Information Theory, 49(8):2061-2067, 2003. 

[Hut03b] M. Hutter. On the existence and convergence of computable universal priors. 

In Proc. 14th International Conf. on Algorithmic Learning Theory (ALT'03), 
volume 2842 of LNAI, pages 298-312, Sapporo, 2003. Springer, Berlin. 

[Hut03c] M. Hutter. Optimality of universal Bayesian prediction for general loss and 
alphabet. Journal of Machine Learning Research, 4:971-1000, 2003. 

[Hut04] M. Hutter. Universal Artificial Intelligence: Sequential Decisions based 
on Algorithmic Probability. Springer, Berlin, 2004. 300 pages, 
http://www.hutterl.net/ai/uaibook.htm. 



20 



[Hut06] M. Hutter. On generalized computable universal priors and their convergence. 
Theoretical Computer Science, 364:27-41, 2006. 

[Jay03] E. T. Jaynes. Probability Theory: The Logic of Science. Cambridge University 
Press, Cambridge, MA, 2003. 

[Jef46] H. Jeffreys. An invariant form for the prior probability in estimation problems. 
In Proc. Royal Society London, volume Series A 186, pages 453-461, 1946. 

[Jef61] H. Jeffreys. Theory of Probability. Clarendon Press, Oxford, 3rd edition, 1961. 

[Kol65] A. N. Kolmogorov. Three approaches to the quantitative definition of informa- 
tion. Problems of Information and Transmission, l(l):l-7, 1965. 

[Kri98] R. E. Krichcvskiy. Laplace's law of succession and universal encoding. IEEE 
Transactions on Information Theory, 44:296-303, 1998. 

[KW96] R. E. Kass and L. Wasscrman. The selection of prior distributions by formal 
rules. Journal of the American Statistical Association, 91(435):1343-1370, 1996. 

[Lapl2] P. Laplace. Theorie analytique des probabilites. Courcier, Paris, 1812. [English 
translation by F. W. Truscott and F. L. Emory: A Philosophical Essay on 
Probabilities. Dover, 1952]. 

[Lev 74] L. A. Levin. Laws of information conservation (non-growth) and aspects of 
the foundation of probability theory. Problems of Information Transmission, 
10(3):206-210, 1974. 

[LV97] M. Li and P. M. B. Vitanyi. An Introduction to Kolmogorov Complexity and its 
Applications. Springer, New York, 2nd edition, 1997. 

[LZ76] A. Lempel and J. Ziv. On the complexity of finite sequences. IEEE Transactions 
on Information Theory, 22:75-81, 1976. 

[Mah04] P. Maher. Probability captures the logic of scientific confirmation. In C. Hitch- 
cock, editor. Contemporary Debates in Philosophy of Science, chapter 3, pages 
69-93. Blackwell Publishing, 2004. 

[MF98] N. Merhav and M. Feder. Universal prediction. IEEE Transactions on Infor- 
mation Theory, 44(6):2124-2147, 1998. 

[Miil06] Markus Miiller. Stationary algorithmic probability. Technical Report 
http://arXiv.org/abs/cs/0608095, TU Berlin, Berlin, 2006. 

[PH04] J. Poland and M. Hutter. On the convergence speed of MDL predictions for 
Bernoulli sequences. In Proc. 15th International Conf. on Algorithmic Learning 
Theory (ALT'04), volume 3244 of LNAI, pages 294-308, Padova, 2004. Springer, 
Berlin. 

[PH06] J. Poland and M. Hutter. MDL convergence speed for Bernoulli sequences. 
Statistics and Computing, 16(2): 161-175, 2006. 



21 



[Ris83] J. J. Rissanen. A universal prior for integers and estimation by minimum de- 
scription length. Annals of Statistics, 11(2):416-431, 1983. 

[Ris89] J. J. Rissanen. Stochastic Complexity in Statistical Inquiry. World Scientific, 
Singapore, 1989. 

[Sch02a] J. Schmidhuber. Hierarchies of generalized Kolmogorov complexities and 
nonenumerable universal measures computable in the limit. International Jour- 
nal of Foundations of Computer Science, 13(4):587-612, 2002. 

[Sch02b] J. Schmidhuber. The speed prior: A new simplicity measure yielding near- 
optimal computable predictions. In Proc. 15th Conf. on Computational Learn- 
ing Theory (COLT-2002), volume 2375 of LNAI, pages 216-228, Sydney, 2002. 
Springer, Berlin. 

[Sch04] J. Schmidhuber. Optimal ordered problem solver. Machine Learning, 54(3):211- 
254, 2004. 

[Sol64] R. J. Solomonoff. A formal theory of inductive inference: Parts 1 and 2. Infor- 
mation and Control, 7:1-22 and 224-254, 1964. 

[Sol78] R. J. Solomonoff. Complexity-based induction systems: Comparisons and con- 
vergence theorems. IEEE Transactions on Information Theory, IT-24:422-432, 
1978. 

[Vap99] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, Berlin, 2nd 
edition, 1999. 

[Wal96] P. Walley. Inferences from multinomial data: learning about a bag of marbles. 
Journal of the Royal Statistical Society B, 58(l):3-57, 1996. 

[Wal05] C. S. Wallace. Statistical and Inductive Inference by Minimum Message Length. 
Springer, Berlin, 2005. 

[WM97] D. H. Wolpert and W. G. Macready. No free lunch theorems for optimization. 
IEEE Transactions on Evolutionary Computation, l(l):67-82, 1997. 

[ZL70] A. K. Zvonkin and L. A. Levin. The complexity of finite objects and the devel- 
opment of the concepts of information and randomness by means of the theory 
of algorithms. Russian Mathematical Surveys, 25(6):83-124, 1970. 



A Proofs of (IHD, ([H/'), and ([HI) 



Proof of loss bound ([8]) . Let X and Y be real- valued random variables. Taking 
the square root of the well-known Schwarz inequality (E[XF])^ < E[X^]E[F^] we 
get 

E[{X-Y)'']- {^/E[X^-^/E^] f = 2^/E[X^]E[Y^] - 2E[XY] > 0. 
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Substituting X^^Jlii^ Y ^ \fhi^ ^^^t>j... with VY,'-=^{Oi^ we get, after 

multiplying with fs, the "Hellinger" bound 

\lY~^'i)' < E.^.(v^-v^i)' (19) 

for real ai,bi,Vi>0 (also valid for t>2 = 0). I will use (flQll three times in proving (JH]). 
With the abbreviations m = y^'^ and s = y^^ and 

X = {1,...,N}, iV=|A'|, i = xt, Hi = fi{xt\uj<t), Zi = ^{xt\uj<t) 

the loss (I7l) and Hellinger distance (jl]) can then be expressed by = Yliiyi^isi ^t'' = 
'E^iVdim and ht = Yji{\f^i- \fVi?' ■ By definition ([6]) of yf'' and ?//^« we have 

Z/i^im < ^ l/i^ij and ^ Zifj^ < ^ ZiUj for all j. (20) 

i % i i 

Actually, I need the first constraint only for j = s and the second for j = m. From 
(EOl) we get 

\/Y^s- ^Y^~yJ^ > and (21) 

That is, if we decrease £is^iis:=iis—Si and iim^ ^im'-^ ^im—^i by the same amount 
6i, then (12T!) increases. The maximal possible 6i := min{iis/im} makes i^^ or 
zero, hence 0<£^^+£^„<l. Similarly 

< VT~^n-VT~^s < VT~^-VT~^s 

This implies 

< V2>:,(4+o(yy--yi-)^ < v/2E^¥^7^ = 

In the third inequality I used the Hellinger bound (fT9|) twice, and in the fourth 
inequality I used ^/a + ^/b< ^y2{a + b). Without the reduction i^i' the bound 
would have been a factor of a/2 worse. Taking the square, expectation, and sum 
over t proves the last inequality in ([8]). The first inequality in ([8]) is again an 
instantiation of f|T9l) with i^(t,Lj^t) and t>j yu(ti;<t), i.e. ^J2t^[---] 
ai^l^^ and bi^l^''. ■ 

Proof of equation (lll[ f ). Function P(y) := X]x7(x)=j/2~^'^^^ is lower semi- 
computable, since K{x) is upper semi-computable, all x G X* can be enumerated. 
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and f{x) = y is decidable. Further, J2yPiy) = ^x'^^^^^^ — ^^ hence MDL bound (ITTb ) 
imphes K{y) < —\og2P{y) + K{P). Let g (y) = m.m{x : f{y)=x} be the lexicograph- 
ically first inverse of /. With K{P) < K{f) = 0{1), also function g has complexity 
0(1). Hence 

x:/(x)=y 

where I dropped all but the contribution from g{y) in the sum, and used ffTTH ) for 
9- ■ 

Proof of bound dm) M(S^|£C<„) ^ 2-'^("). For x = x<„ G A"""! anda = x„GA' 
we have 

(a) M(xa) (J Ep:C/(p)=a;a. ^-^^^^ W Ep:C/(p)=xa ^"^^^^ (J ^ 

In (a) and (6) I simply inserted the definition flTBl) of M. I now (c) restrict the 
sum over all p : ?7(p) = xa* in the numerator to programs p of the following form: 
p = qn*p, where U{p) = x*, n* is the shortest code of n, and q simulates p until 
n — 1 symbols are printed, then prints a, and thereafter halts, i.e. U{p) = xa. The 
numerator now sums over exactly the same programs p as the denominator. Since 
2-f-(p) — 2-^('3"*)2-^(p)^ and 2~^^'^"'*^ is a constant independent of p, numerator and 
denominator cancel and (d) follows, (e) follows from the definition of n* and from 
i{q) = 0{l). m 

Proof of bound ([171) M(^„|a;<„) ^ 22^'^(^^^")--^("). Assume xi-.oo is a com- 
putable sequence, X is binary, and x„ 7^ x^, and define P(n) := M(x<„x„). 
Given xi;oo, P can be semi-computed from below, hence K{P) <Km{xl■oo)■ 
Mso T^^Pin) < 1, since {x<„a;„ : n G W} forms a prefix-free set. Hence 
K{n)<-log2Pin)+K{P) by ([Tl;), which implies M(x<„x„) ^ 2^™(^i^-)-^('"). 
Since M(x<„) > 2-^'"(^<") > 2-^'"(^i^-), we get M(x„|x<„) ^ 22^™(^i-)-^W, which 
nearly is (fT7|) . Since the l.h.s. is independent of x„+i:oo, a bound independent of it 
should be (and is) possible, as we will now show. 

Consider sequence xi.n and shortest program p printing xi:„*. Let 
Ut be U stopped after t time steps and define corresponding Mt. Then 
Ut{p) = xi-nt (for some Xn+i-.nt if nt > n). I define Pt{n') := Y.a^x„,^t{,x^n'Ci) 
for n' < rit and for n' > rit. With rit also Pt is computable and increasing, 
hence P{n') := limt^oo-Pt(^') = supjPt(n') is lower semi- computable. Clearly 
P(n') = E^_^^ ^M(x<„'a) for n' <noo and P(n') = for n'>noo i'n'^ = limtnt<oo). 
Hence J^n'^i^') — since {x<n/a : aj^Xn',n' <noo} is a prefix free set, which implies 
K{n) ^ -log2P(n) + K(P) by (^). Since n<n^ and K(P) ^ ^(p) = fCm(xi:„), we 
get ^^_,^.^M(x<„a) ^2^™(^i-)-^("). Using M(x<„) >2-^"(^<") >2-^™(^i-), we get 
the desired bound M(x„|x<„) < E^^^^M(a|x<„) ^ 22^™(^i-)-^("). ■ 
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